perf(service): Write HV tombstone before LT data to reduce orphan risk#365
Draft
perf(service): Write HV tombstone before LT data to reduce orphan risk#365
Conversation
Previously, large-object inserts followed LT-first ordering: write data to long-term storage, then write the redirect tombstone to high-volume. A failure or pod kill between those two steps left an orphaned long-term object — data in LT with no tombstone in HV, permanently unreachable. This flips the ordering to HV-first: write the tombstone first, then write the data. A failure between the two steps now leaves a headless tombstone (tombstone in HV, no data in LT) instead. Headless tombstones are safe and self-healing: reads return None, deletes remove them, and re-inserts overwrite them. For small objects, a new `put_non_tombstone` trait method is added that atomically rejects the write if a tombstone already exists at the key, routing the payload to long-term storage instead. BigTable implements this with a CheckAndMutateRowRequest; other backends use a non-atomic read-then-write default. There is one remaining gap: a concurrent insert + delete can still race to produce an orphaned long-term object. Fixing that requires per-key serialization and is tracked separately. Co-Authored-By: Claude <noreply@anthropic.com>
1ab7bfe to
9f5bb43
Compare
Deduplicates the expiration/mutation-building logic shared between put_row and put_non_tombstone into a single write_mutations method. Co-Authored-By: Claude <noreply@anthropic.com>
…tcome
Both types represented the same concept — an operation that either
executed or was blocked by a redirect tombstone — with different variant
names. Unify them into ConditionalOutcome { Executed, Tombstone }.
Co-Authored-By: Claude <noreply@anthropic.com>
…ibility Keep the ConditionalOutcome rename from this branch while adopting the pub visibility from main.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously, large-object inserts followed LT-first ordering: write data to long-term storage, then write the redirect tombstone to high-volume. Concurrent inserts or pod kills between those two steps left an orphaned long-term object — data in LT with no tombstone in HV — permanently unreachable with no recovery path.
This flips the ordering to HV-first: write the tombstone first, then write the data. A failure between the two steps now produces a headless tombstone (tombstone in HV, no data in LT) instead. Headless tombstones are safe and self-healing: reads return
None, deletes remove them, and re-inserts overwrite them.The tradeoff is deliberate: we lower the risk of orphans in long-term storage — which are silent data leaks with no recovery — and instead accept headless tombstones, which are a well-defined recoverable state.
For small objects, a new
put_non_tombstonetrait method atomically rejects the write if a tombstone already exists at the key, routing the payload to long-term storage instead. BigTable implements this withCheckAndMutateRowRequest; other backends fall back to a non-atomic read-then-write.One gap remains: a concurrent insert + delete can still race to produce an orphaned long-term object. Fixing that requires per-key serialization. This PR is an intermediary improvement; the full solution with no orphans at all is tracked separately.
Ref FS-236